This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.
No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.
The data set is provided by the Agency of Food Standards. The data set contains the following variable:
| Variable | Discription |
|---|---|
| Country | This is the Country of the Local Authority |
| La_type | Type of Local Authority |
| La_name | Name of Local Authority |
| Totalestablisments_includingnotyetrated_inside | These are the total number of establishments in the area which also include those that have not been rated and are yet to be part of the programme |
| establishmentnotyetratedforintervention | These are number of establishments that are to be rated for the intervention |
| establishmentoutsidetheprogramme | These are the number of establishments which are not part of the programme yet |
| Total_percent_of_broadly_compliantestablishmentsrated_a_e | These are total percentages of establishments which are rated from A to E and then are in general compliance |
| Total_percent_of_broadly_compliantestablishments_includingnotyetrated | These are total percentages of establishments which are not rated are in general compliance with the regulations |
| Aratedestablishments | These are establishments which are in the Area and are rated A(These have the most impact on the health of public) |
| Total_percent_of_broadly_compliantestablishments_a | A rated establishments which are broadly compliant |
| Bratedestablishments | These are the establishments which are in the AreaB |
| Total_percent_of_broadly_compliantestablishments_b | These are total number of establishments which are rated B and are only broadly compliant |
| Cratedestablishments | These are number of establishments which are only rated in the Area C |
| Total_percent_of_broadly_compliantestablishments_c | These are only total percentages of establishments in the Area rated C and are broadly compliant |
| Dratedestablishments | These are number of establishments which are only rated in the Area D |
| Total_percent_of_broadly_compliantestablishments_d | These are only total percentages of establishments in the Area rated D and are broadly complaint |
| Eratedestablishments | Number of establishments in the Area rated E |
| Total_percent_of_broadly_compliantestablishments_e | These are only total percentages of establishments in the Area rated E and are broadly compliant |
| Total_percent_of_interventionsachieved_premisesrated_a_e) | These are total percentage of premises rated A to E for all premises |
| Total_percent_of_interventionsachieved_premisesrated_a | These are total percentage of premises rated A |
| Total_percent_of_interventionsachieved_premisesrated_b | These are total percentage of premises rated B |
| Total_percent_of_interventionsachieved_premisesrated_c | These are total percentage of premises rated C |
| Total_percent_of_interventionsachieved_premisesrated_d | These are total percentage of premises rated D |
| Total_percent_of_interventionsachieved_premisesrated_e | These are total percentage of premises rated E |
| Total_percent_of_interventionsachieved_premisesnotyetrated | These are total percentage of interventions that are not yet rated |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_voluntaryclosure | These are total number of establishments which are subjected to formal enforcement actions such as voluntary closure |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_seizure_detention_surrenderoffood | These are total number of establishments which are subjected to formal enforcements like food surrender, seizures or Detention |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_suspension_revocationofapprovalorlicence | These are total number of establishments which are facing formal enforcement actions like revocation of their license approval or suspension |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_hygieneemergencyprohibitionnotice | These are total number of establishments which are subjected to formal enforcements actions like receiving a Hygeine Emergency Prohibition Notice |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_prohibitionorder | These are total number of establishments which are subjected to formal enforcements actions like receiving an order of prohibition |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_simplecaution | Thesea re total number of establishments which are subjected to formal enforcements actions like a simple caution |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_hygieneimprovementnotices | These are total number of establshments which are subjected to formal engagements like receiving hygiene improvements notices |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_remedialaction_detentionnotices | These are total number of establishments which are subjected to formal enforcements actions like detection notices or remedial action |
| Totalnumberofestablishmentssubjectto_writtenwarnings | These are total number of establishments which are subjected to written form of warnings |
| Totalnumberofestablishmentssubjecttoformalenforcementactions_Prosecutionsconcluded | Thesea re total number of establishments which are subjected to formal enforcements actions whcich are concluded prosecutions |
| Professional_full_time_equivalent_posts_occupied | These are total number of professional full time positions which are currently occupied by the local authority |
food_hygeine <- read_csv('2019-20-enforcement-data-food-hygiene.csv')
## Rows: 353 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Country, LAType, LAName, Total%ofBroadlyCompliantestablishments-A,...
## dbl (30): Totalestablishments(includingnotyetrated&outside), Establishmentsn...
## num (1): TotalnumberofestablishmentssubjecttoWrittenwarnings
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
food_hygeine <- food_hygeine %>% clean_names()
food_hygeine <- na.omit(food_hygeine)
food_hygeine$total_percent_of_interventionsachieved_premisesrated_a <- replace(food_hygeine$total_percent_of_interventionsachieved_premisesrated_a, food_hygeine$total_percent_of_interventionsachieved_premisesrated_a == 'NR', 0)
food_hygeine$total_percent_of_interventionsachieved_premisesrated_a = as.numeric(food_hygeine$total_percent_of_interventionsachieved_premisesrated_a)
food_hygeine$country = as.factor(food_hygeine$country)
food_hygeine$la_type = as.factor(food_hygeine$la_type)
food_hygeine$la_name = as.factor(food_hygeine$la_name)
#summary(food_hygeine)
#AS we can observe that in column interventions there are 24 NAs which achieve the premis rated A, and we shall replace them, and we should do that with the mean of interventions.
avg_premisesrateda <- mean(food_hygeine$total_percent_of_interventionsachieved_premisesrated_a, na.rm = TRUE)
food_hygeine <- replace_na(food_hygeine, list(total_percent_of_interventionsachieved_premisesrated_a = avg_premisesrateda))
#x1 <- na.omit(food_hygeine)
ggplot(data = food_hygeine, aes(x= total_percent_of_interventionsachieved_premisesrated_a_e, position = 'identity', fill = country)) + geom_histogram(binwidth = 1) +
labs(x="Percentage %", y = "Count", title = "Distrbution of Successful Enforcement Action Percentage for All Levels A-E")
#### 1.1.2 Distribution for level A-E seperately
##GGplot for all different premis rated, premis rated from A to E
ggplot(data = food_hygeine, aes(x=total_percent_of_interventionsachieved_premisesrated_a)) +
geom_histogram(binwidth = 1,colour="black", fill="orange") +
labs(x = 'Interventions achieved A(in %)', y = 'Count', title = "Distrbution of Successful Enforcement Action Percentage for Level A") +
ylim(0,400) +
scale_y_continuous( limit=c(0,400),breaks = seq(0,400,25),expand = c(0,0))
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
ggplot(data = food_hygeine, aes(x=total_percent_of_interventionsachieved_premisesrated_b)) + geom_histogram(binwidth = 1,colour="black", fill="orange") + labs(x = 'Interventions Achieved B(in %)', y = 'Count', title = "Distrbution of Successful Enforcement Action Percentage for Level B") +
ylim(0,350) +
scale_y_continuous( breaks = seq(0,350,25),expand = c(0,0))
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
ggplot(data = food_hygeine, aes(x=total_percent_of_interventionsachieved_premisesrated_c)) + geom_histogram(binwidth = 1,colour="black", fill="orange") + labs(x = 'Interventions Achieved C(in %)', y = 'Count', title = "Distrbution of Successful Enforcement Action Percentage for Level C") +
ylim(0,350) +
scale_y_continuous( breaks = seq(0,200,25),expand = c(0,0))
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
ggplot(data = food_hygeine, aes(x=total_percent_of_interventionsachieved_premisesrated_d)) + geom_histogram(binwidth = 1,colour="black", fill="orange") + labs(x = 'Interventions Achieved D(in %)', y = 'count', title = "Distrbution of Successful Enforcement Action Percentage for Level D") +
ylim(0,350) +
scale_y_continuous( breaks = seq(0,350,25),expand = c(0,0))
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
ggplot(data = food_hygeine, aes(x=total_percent_of_interventionsachieved_premisesrated_e)) + geom_histogram(binwidth = 1,colour="black", fill="orange") + labs(x = 'Interventions Achieved E(in %)', y = 'count', title = "Distrbution of Successful Enforcement Action Percentage for Level E") +
ylim(0,350) +
scale_y_continuous( breaks = seq(0,350,25),expand = c(0,0))
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
food_hygeine_correlation <- rcorr(as.matrix(select(food_hygeine, total_percent_of_interventionsachieved_premisesrated_e,total_percent_of_interventionsachieved_premisesrated_d,total_percent_of_interventionsachieved_premisesrated_c,total_percent_of_interventionsachieved_premisesrated_b,total_percent_of_interventionsachieved_premisesrated_a,total_percent_of_interventionsachieved_premisesrated_a_e,professional_full_time_equivalent_posts_occupied)), type = "spearman")
food_hygeine_correlation
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_e 1.00
## total_percent_of_interventionsachieved_premisesrated_d 0.55
## total_percent_of_interventionsachieved_premisesrated_c 0.38
## total_percent_of_interventionsachieved_premisesrated_b 0.28
## total_percent_of_interventionsachieved_premisesrated_a 0.05
## total_percent_of_interventionsachieved_premisesrated_a_e 0.81
## professional_full_time_equivalent_posts_occupied 0.03
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_e 0.55
## total_percent_of_interventionsachieved_premisesrated_d 1.00
## total_percent_of_interventionsachieved_premisesrated_c 0.71
## total_percent_of_interventionsachieved_premisesrated_b 0.45
## total_percent_of_interventionsachieved_premisesrated_a 0.03
## total_percent_of_interventionsachieved_premisesrated_a_e 0.85
## professional_full_time_equivalent_posts_occupied -0.09
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_e 0.38
## total_percent_of_interventionsachieved_premisesrated_d 0.71
## total_percent_of_interventionsachieved_premisesrated_c 1.00
## total_percent_of_interventionsachieved_premisesrated_b 0.47
## total_percent_of_interventionsachieved_premisesrated_a 0.07
## total_percent_of_interventionsachieved_premisesrated_a_e 0.69
## professional_full_time_equivalent_posts_occupied -0.05
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_e 0.28
## total_percent_of_interventionsachieved_premisesrated_d 0.45
## total_percent_of_interventionsachieved_premisesrated_c 0.47
## total_percent_of_interventionsachieved_premisesrated_b 1.00
## total_percent_of_interventionsachieved_premisesrated_a 0.11
## total_percent_of_interventionsachieved_premisesrated_a_e 0.45
## professional_full_time_equivalent_posts_occupied 0.04
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_e 0.05
## total_percent_of_interventionsachieved_premisesrated_d 0.03
## total_percent_of_interventionsachieved_premisesrated_c 0.07
## total_percent_of_interventionsachieved_premisesrated_b 0.11
## total_percent_of_interventionsachieved_premisesrated_a 1.00
## total_percent_of_interventionsachieved_premisesrated_a_e 0.03
## professional_full_time_equivalent_posts_occupied 0.12
## total_percent_of_interventionsachieved_premisesrated_a_e
## total_percent_of_interventionsachieved_premisesrated_e 0.81
## total_percent_of_interventionsachieved_premisesrated_d 0.85
## total_percent_of_interventionsachieved_premisesrated_c 0.69
## total_percent_of_interventionsachieved_premisesrated_b 0.45
## total_percent_of_interventionsachieved_premisesrated_a 0.03
## total_percent_of_interventionsachieved_premisesrated_a_e 1.00
## professional_full_time_equivalent_posts_occupied 0.00
## professional_full_time_equivalent_posts_occupied
## total_percent_of_interventionsachieved_premisesrated_e 0.03
## total_percent_of_interventionsachieved_premisesrated_d -0.09
## total_percent_of_interventionsachieved_premisesrated_c -0.05
## total_percent_of_interventionsachieved_premisesrated_b 0.04
## total_percent_of_interventionsachieved_premisesrated_a 0.12
## total_percent_of_interventionsachieved_premisesrated_a_e 0.00
## professional_full_time_equivalent_posts_occupied 1.00
##
## n= 347
##
##
## P
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.3941
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.6158
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.5895
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.0852
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.2053
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.3737
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_a 0.0505
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.4371
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_e 0.3941
## total_percent_of_interventionsachieved_premisesrated_d 0.5895
## total_percent_of_interventionsachieved_premisesrated_c 0.2053
## total_percent_of_interventionsachieved_premisesrated_b 0.0505
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_a_e 0.5900
## professional_full_time_equivalent_posts_occupied 0.0211
## total_percent_of_interventionsachieved_premisesrated_a_e
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.5900
## total_percent_of_interventionsachieved_premisesrated_a_e
## professional_full_time_equivalent_posts_occupied 0.9771
## professional_full_time_equivalent_posts_occupied
## total_percent_of_interventionsachieved_premisesrated_e 0.6158
## total_percent_of_interventionsachieved_premisesrated_d 0.0852
## total_percent_of_interventionsachieved_premisesrated_c 0.3737
## total_percent_of_interventionsachieved_premisesrated_b 0.4371
## total_percent_of_interventionsachieved_premisesrated_a 0.0211
## total_percent_of_interventionsachieved_premisesrated_a_e 0.9771
## professional_full_time_equivalent_posts_occupied
##Linear regression model to see the relationship between number of employees and all successful premis enforcement achieved
ggplot(food_hygeine, aes(y=total_percent_of_interventionsachieved_premisesrated_a_e, x=professional_full_time_equivalent_posts_occupied)) + geom_point()+ labs(x= "No. of Employees", y="All Successful Premises Enforcement Achieved (in percentage %)")+ geom_smooth(method=lm) + geom_jitter()
FullTimeEmployeesAtoE <- lm(total_percent_of_interventionsachieved_premisesrated_a_e ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
FullTimeEmployeesAtoE
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_a_e ~
## professional_full_time_equivalent_posts_occupied, data = food_hygeine)
##
## Coefficients:
## (Intercept)
## 87.1091
## professional_full_time_equivalent_posts_occupied
## -0.1195
summary(FullTimeEmployeesAtoE)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_a_e ~
## professional_full_time_equivalent_posts_occupied, data = food_hygeine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.304 -4.575 4.067 8.658 13.860
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 87.1091 1.2828 67.905
## professional_full_time_equivalent_posts_occupied -0.1195 0.2675 -0.447
## Pr(>|t|)
## (Intercept) <2e-16 ***
## professional_full_time_equivalent_posts_occupied 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.4 on 345 degrees of freedom
## Multiple R-squared: 0.0005787, Adjusted R-squared: -0.002318
## F-statistic: 0.1998 on 1 and 345 DF, p-value: 0.6552
cbind(coeffcient=coef(FullTimeEmployeesAtoE), confint(FullTimeEmployeesAtoE))
## coeffcient 2.5 %
## (Intercept) 87.1091495 84.5860343
## professional_full_time_equivalent_posts_occupied -0.1195469 -0.6456029
## 97.5 %
## (Intercept) 89.6322647
## professional_full_time_equivalent_posts_occupied 0.4065092
FullTimeEmployeesA <- lm(total_percent_of_interventionsachieved_premisesrated_a ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
FullTimeEmployeesB <- lm(total_percent_of_interventionsachieved_premisesrated_b ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
FullTimeEmployeesC <- lm(total_percent_of_interventionsachieved_premisesrated_c ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
FullTimeEmployeesD <- lm(total_percent_of_interventionsachieved_premisesrated_d ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
FullTimeEmployeesE <- lm(total_percent_of_interventionsachieved_premisesrated_e ~ professional_full_time_equivalent_posts_occupied, data = food_hygeine)
##Adjusted R-squared: -0.002318
##F-statistic: 0.1998 on 1 and 345 DF, p-value: 0.6552
food_hygeine_new <- food_hygeine %>% mutate(total_rated_establishments = (totalestablishments_includingnotyetrated_outside - establishmentsnotyetratedforintervention - establishmentsoutsidetheprogramme), Employees_Proportion = round((professional_full_time_equivalent_posts_occupied /total_rated_establishments)*100,2))
FTemployeesAtoE <- lm(total_percent_of_interventionsachieved_premisesrated_a_e ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesAtoE )
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_a_e ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.045 -5.352 4.403 8.379 15.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.512 1.932 41.149 < 2e-16 ***
## Employees_Proportion 24.142 6.180 3.907 0.000113 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.13 on 345 degrees of freedom
## Multiple R-squared: 0.04236, Adjusted R-squared: 0.03959
## F-statistic: 15.26 on 1 and 345 DF, p-value: 0.0001126
cbind(coeffcient=coef(FTemployeesAtoE ), confint(FTemployeesAtoE))
## coeffcient 2.5 % 97.5 %
## (Intercept) 79.51215 75.71161 83.31269
## Employees_Proportion 24.14159 11.98709 36.29608
##Adjusted R-squared: 0.03959
##F-statistic: 15.26 on 1 and 345 DF, p-value: 0.0001126
FTemployeesforA <- lm(total_percent_of_interventionsachieved_premisesrated_a ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesforA)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_a ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.482 8.411 8.696 8.802 9.105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.771 4.138 21.935 <2e-16 ***
## Employees_Proportion 1.778 13.234 0.134 0.893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.99 on 345 degrees of freedom
## Multiple R-squared: 5.232e-05, Adjusted R-squared: -0.002846
## F-statistic: 0.01805 on 1 and 345 DF, p-value: 0.8932
cbind(coeffcient=coef(FTemployeesforA), confint(FTemployeesforA))
## coeffcient 2.5 % 97.5 %
## (Intercept) 90.770882 82.63173 98.91003
## Employees_Proportion 1.778014 -24.25176 27.80779
##Adjusted R-squared: -0.002846
##F-statistic: 0.01805 on 1 and 345 DF, p-value: 0.8932
FTemployeesforB <- lm(total_percent_of_interventionsachieved_premisesrated_b ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesforB)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_b ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.281 -1.647 2.574 4.616 4.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.893 1.089 87.175 <2e-16 ***
## Employees_Proportion 1.213 3.481 0.348 0.728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.836 on 345 degrees of freedom
## Multiple R-squared: 0.0003518, Adjusted R-squared: -0.002546
## F-statistic: 0.1214 on 1 and 345 DF, p-value: 0.7277
cbind(coeffcient=coef(FTemployeesforB), confint(FTemployeesforB))
## coeffcient 2.5 % 97.5 %
## (Intercept) 94.892512 92.751533 97.033491
## Employees_Proportion 1.213003 -5.634052 8.060059
##Adjusted R-squared: -0.002546
##F-statistic: 0.1214 on 1 and 345 DF, p-value: 0.7277
FTemployeesforC <- lm(total_percent_of_interventionsachieved_premisesrated_c ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesforC)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_c ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.059 -2.712 3.239 5.865 10.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.365 1.464 60.366 <2e-16 ***
## Employees_Proportion 11.814 4.681 2.524 0.0121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.193 on 345 degrees of freedom
## Multiple R-squared: 0.01812, Adjusted R-squared: 0.01528
## F-statistic: 6.368 on 1 and 345 DF, p-value: 0.01207
cbind(coeffcient=coef(FTemployeesforC), confint(FTemployeesforC))
## coeffcient 2.5 % 97.5 %
## (Intercept) 88.36538 85.486223 91.24454
## Employees_Proportion 11.81402 2.606194 21.02185
##Adjusted R-squared: 0.01528
##F-statistic: 6.368 on 1 and 345 DF, p-value: 0.01207
FTemployeesforD <- lm(total_percent_of_interventionsachieved_premisesrated_d ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesforD)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_d ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.764 -4.461 5.055 9.230 15.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.235 2.283 35.582 <2e-16 ***
## Employees_Proportion 17.281 7.301 2.367 0.0185 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.34 on 345 degrees of freedom
## Multiple R-squared: 0.01598, Adjusted R-squared: 0.01312
## F-statistic: 5.602 on 1 and 345 DF, p-value: 0.0185
cbind(coeffcient=coef(FTemployeesforD), confint(FTemployeesforD))
## coeffcient 2.5 % 97.5 %
## (Intercept) 81.23538 76.744986 85.72578
## Employees_Proportion 17.28058 2.919861 31.64130
##Adjusted R-squared: 0.01312
##F-statistic: 5.602 on 1 and 345 DF, p-value: 0.0185
FTemployeesforE <- lm(total_percent_of_interventionsachieved_premisesrated_e ~ Employees_Proportion, data = food_hygeine_new)
summary(FTemployeesforE)
##
## Call:
## lm(formula = total_percent_of_interventionsachieved_premisesrated_e ~
## Employees_Proportion, data = food_hygeine_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.912 -12.654 8.201 17.710 29.396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.215 3.764 17.061 < 2e-16 ***
## Employees_Proportion 44.679 12.037 3.712 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.64 on 345 degrees of freedom
## Multiple R-squared: 0.0384, Adjusted R-squared: 0.03561
## F-statistic: 13.78 on 1 and 345 DF, p-value: 0.0002398
cbind(coeffcient=coef(FTemployeesforE), confint(FTemployeesforE))
## coeffcient 2.5 % 97.5 %
## (Intercept) 64.21526 56.81236 71.61816
## Employees_Proportion 44.67894 21.00378 68.35411
##Adjusted R-squared: 0.03561
##F-statistic: 13.78 on 1 and 345 DF, p-value: 0.0002398
food_hygeine$total_percent_of_interventionsachieved_premisesrated_a = as.numeric(food_hygeine$total_percent_of_interventionsachieved_premisesrated_a)
str(food_hygeine)
## tibble [347 × 36] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 3 levels "England","Northern Ireland",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ la_type : Factor w/ 6 levels "District Council",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ la_name : Factor w/ 347 levels "Adur and Worthing",..: 1 2 3 8 9 10 11 12 16 17 ...
## $ totalestablishments_includingnotyetrated_outside : num [1:347] 1478 1316 1112 1208 905 ...
## $ establishmentsnotyetratedforintervention : num [1:347] 24 29 1 44 26 0 58 40 41 84 ...
## $ establishmentsoutsidetheprogramme : num [1:347] 0 74 0 1 1 0 214 39 0 42 ...
## $ total_percent_of_broadly_compliantestablishmentsrated_a_e : num [1:347] 97.2 97.2 97.5 97.7 96.7 ...
## $ total_percent_of_broadly_compliantestablishments_includingnotyetrated : num [1:347] 95.6 94.9 97.4 94.1 93.9 ...
## $ aratedestablishments : num [1:347] 3 2 2 3 1 5 1 4 1 4 ...
## $ total_percent_of_broadly_compliantestablishments_a : chr [1:347] "33.33" "50" "50" "0" ...
## $ bratedestablishments : num [1:347] 39 26 39 28 31 15 20 44 31 36 ...
## $ total_percent_of_broadly_compliantestablishments_b : num [1:347] 69.2 76.9 64.1 82.1 77.4 ...
## $ cratedestablishments : num [1:347] 227 243 179 211 145 125 270 219 96 190 ...
## $ total_percent_of_broadly_compliantestablishments_c : num [1:347] 91.2 90.1 93.8 94.3 89.7 ...
## $ dratedestablishments : num [1:347] 592 469 432 483 353 453 555 626 186 519 ...
## $ total_percent_of_broadly_compliantestablishments_d : num [1:347] 99 99.4 99.5 98.5 98.3 ...
## $ eratedestablishments : num [1:347] 593 473 459 438 348 534 628 1030 219 525 ...
## $ total_percent_of_broadly_compliantestablishments_e : num [1:347] 99.8 100 100 100 100 ...
## $ total_percent_of_interventionsachieved_premisesrated_a_e : num [1:347] 96.1 90.6 88.9 94 80.7 ...
## $ total_percent_of_interventionsachieved_premisesrated_a : num [1:347] 100 100 100 100 60 80 100 100 100 100 ...
## $ total_percent_of_interventionsachieved_premisesrated_b : num [1:347] 100 98.3 95.1 96.3 100 ...
## $ total_percent_of_interventionsachieved_premisesrated_c : num [1:347] 95.5 89.7 97 94.4 78.8 ...
## $ total_percent_of_interventionsachieved_premisesrated_d : num [1:347] 96 93 91.8 92.6 85.3 ...
## $ total_percent_of_interventionsachieved_premisesrated_e : num [1:347] 94 85.1 72.3 95.5 68.3 ...
## $ total_percent_of_interventionsachieved_premisesnotyetrated : num [1:347] 100 100 100 95.4 79.6 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_voluntaryclosure : num [1:347] 5 0 0 2 1 0 0 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_seizure_detention_surrenderoffood : num [1:347] 4 0 0 0 0 0 0 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_suspension_revocationofapprovalorlicence: num [1:347] 0 0 0 0 0 0 1 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_hygieneemergencyprohibitionnotice : num [1:347] 0 0 0 0 0 0 0 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_prohibitionorder : num [1:347] 0 0 0 0 0 0 1 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_simplecaution : num [1:347] 0 0 1 0 0 0 0 0 0 0 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_hygieneimprovementnotices : num [1:347] 3 6 11 3 4 0 3 2 1 2 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_remedialaction_detentionnotices : num [1:347] 0 0 0 0 0 0 0 0 0 0 ...
## $ totalnumberofestablishmentssubjectto_writtenwarnings : num [1:347] 323 413 515 386 252 224 223 152 179 175 ...
## $ totalnumberofestablishmentssubjecttoformalenforcementactions_prosecutionsconcluded : num [1:347] 0 0 1 0 0 0 0 2 0 0 ...
## $ professional_full_time_equivalent_posts_occupied : num [1:347] 5 4 3.5 4 2 4.65 2.5 5 2 4.2 ...
## - attr(*, "na.action")= 'omit' Named int [1:6] 21 36 52 88 163 261
## ..- attr(*, "names")= chr [1:6] "21" "36" "52" "88" ...
We can see in this Figure that we have shown dispersion of all interventions across all establishments from rated A to rated E in three regions which are England, Northern Ireland and Wales. It is visibily clear from the figure that the enforcement of local authorities lies in the range of 90-100% which tells us that the efficiency of our local authorities is really high.
And we have also plotted graph from A rating to E rating, in which we show separate histograms and what is the efficiency of local authorities, and as we can establish from the graphs above that the success rate of local authorities and its intervention is highest for establishments which have rating A and then followed with B, C, D and E. We can see that all of the graphs have peak on their right, which means more than 75% of the local authorities are successful in implementing the interventions.
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_e 1.00
## total_percent_of_interventionsachieved_premisesrated_d 0.55
## total_percent_of_interventionsachieved_premisesrated_c 0.38
## total_percent_of_interventionsachieved_premisesrated_b 0.28
## total_percent_of_interventionsachieved_premisesrated_a 0.05
## total_percent_of_interventionsachieved_premisesrated_a_e 0.81
## professional_full_time_equivalent_posts_occupied 0.03
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_e 0.55
## total_percent_of_interventionsachieved_premisesrated_d 1.00
## total_percent_of_interventionsachieved_premisesrated_c 0.71
## total_percent_of_interventionsachieved_premisesrated_b 0.45
## total_percent_of_interventionsachieved_premisesrated_a 0.03
## total_percent_of_interventionsachieved_premisesrated_a_e 0.85
## professional_full_time_equivalent_posts_occupied -0.09
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_e 0.38
## total_percent_of_interventionsachieved_premisesrated_d 0.71
## total_percent_of_interventionsachieved_premisesrated_c 1.00
## total_percent_of_interventionsachieved_premisesrated_b 0.47
## total_percent_of_interventionsachieved_premisesrated_a 0.07
## total_percent_of_interventionsachieved_premisesrated_a_e 0.69
## professional_full_time_equivalent_posts_occupied -0.05
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_e 0.28
## total_percent_of_interventionsachieved_premisesrated_d 0.45
## total_percent_of_interventionsachieved_premisesrated_c 0.47
## total_percent_of_interventionsachieved_premisesrated_b 1.00
## total_percent_of_interventionsachieved_premisesrated_a 0.11
## total_percent_of_interventionsachieved_premisesrated_a_e 0.45
## professional_full_time_equivalent_posts_occupied 0.04
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_e 0.05
## total_percent_of_interventionsachieved_premisesrated_d 0.03
## total_percent_of_interventionsachieved_premisesrated_c 0.07
## total_percent_of_interventionsachieved_premisesrated_b 0.11
## total_percent_of_interventionsachieved_premisesrated_a 1.00
## total_percent_of_interventionsachieved_premisesrated_a_e 0.03
## professional_full_time_equivalent_posts_occupied 0.12
## total_percent_of_interventionsachieved_premisesrated_a_e
## total_percent_of_interventionsachieved_premisesrated_e 0.81
## total_percent_of_interventionsachieved_premisesrated_d 0.85
## total_percent_of_interventionsachieved_premisesrated_c 0.69
## total_percent_of_interventionsachieved_premisesrated_b 0.45
## total_percent_of_interventionsachieved_premisesrated_a 0.03
## total_percent_of_interventionsachieved_premisesrated_a_e 1.00
## professional_full_time_equivalent_posts_occupied 0.00
## professional_full_time_equivalent_posts_occupied
## total_percent_of_interventionsachieved_premisesrated_e 0.03
## total_percent_of_interventionsachieved_premisesrated_d -0.09
## total_percent_of_interventionsachieved_premisesrated_c -0.05
## total_percent_of_interventionsachieved_premisesrated_b 0.04
## total_percent_of_interventionsachieved_premisesrated_a 0.12
## total_percent_of_interventionsachieved_premisesrated_a_e 0.00
## professional_full_time_equivalent_posts_occupied 1.00
##
## n= 347
##
##
## P
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_e
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.3941
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.6158
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.5895
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.0852
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.2053
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.3737
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b
## total_percent_of_interventionsachieved_premisesrated_a 0.0505
## total_percent_of_interventionsachieved_premisesrated_a_e 0.0000
## professional_full_time_equivalent_posts_occupied 0.4371
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_e 0.3941
## total_percent_of_interventionsachieved_premisesrated_d 0.5895
## total_percent_of_interventionsachieved_premisesrated_c 0.2053
## total_percent_of_interventionsachieved_premisesrated_b 0.0505
## total_percent_of_interventionsachieved_premisesrated_a
## total_percent_of_interventionsachieved_premisesrated_a_e 0.5900
## professional_full_time_equivalent_posts_occupied 0.0211
## total_percent_of_interventionsachieved_premisesrated_a_e
## total_percent_of_interventionsachieved_premisesrated_e 0.0000
## total_percent_of_interventionsachieved_premisesrated_d 0.0000
## total_percent_of_interventionsachieved_premisesrated_c 0.0000
## total_percent_of_interventionsachieved_premisesrated_b 0.0000
## total_percent_of_interventionsachieved_premisesrated_a 0.5900
## total_percent_of_interventionsachieved_premisesrated_a_e
## professional_full_time_equivalent_posts_occupied 0.9771
## professional_full_time_equivalent_posts_occupied
## total_percent_of_interventionsachieved_premisesrated_e 0.6158
## total_percent_of_interventionsachieved_premisesrated_d 0.0852
## total_percent_of_interventionsachieved_premisesrated_c 0.3737
## total_percent_of_interventionsachieved_premisesrated_b 0.4371
## total_percent_of_interventionsachieved_premisesrated_a 0.0211
## total_percent_of_interventionsachieved_premisesrated_a_e 0.9771
## professional_full_time_equivalent_posts_occupied
This tells us the relationship between the overall successful interventions of establishments which ratings from A to E. This is a scatter plot and what we understand, and what we can derive from it is that there is a weak relationship, which is strongly inverse, there is no significant link between the two variables. And we have also taken in consideration certain statistical values to ensure that there is not a significant strong relationship. Thus, we can conclude that hiring more employees or professionals has any major impact on the success of interventions.
For Objective 3 we have used certain statistical measure. The statistical measure that we have used is called r, which is correlation, and co relation is that how are two variables linked or related to each other, and if the r value is from 0 to -1 there is a negative relation, if the r value is between 0 to 1, we can say that there is a postive corelation, here we can see that the correlation is 0.23, and another statistical measure which tells us if the correlation is significant or not is p value, which here is 0.0001126 we can say the correlation is significant, and we can with this conclusion say that the number of employees should be increased so that the success rate of interventions in establishments of local authorities could be improved.
title: “Question 2 Section 1” author: “u2288495” date: “2023-12-14” output: html_document
** Data Dictionary **
This dataset is provided by a Publishing Company. The variables in the dataset are as follows:
| Variable | Description |
|---|---|
| sold by | This is the name of the publisher who is selling the book |
| publisher_type | This is the type of the publisher who is selling the book |
| genre | This is the genre of the book |
| avg_review | This is the average review which is given to by the readers to the book |
| daily_sales | These are average number of sales minus refunds over the entirety of this specific period |
| total_reviews | This column contains the total number of reviews which are given by the readers in the book |
| sale_price | These is the total average price which is sold over this specific course of period |
salesofbooks <- read_csv('publisher_sales.csv')
## Rows: 6000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sold by, publisher.type, genre
## dbl (4): avg.review, daily.sales, total.reviews, sale.price
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
salesofbooks <- salesofbooks %>% clean_names()
#summary(salesofbooks)
salesofbooks$genre <- as.factor(salesofbooks$genre)
mean.dailySales.genre <- salesofbooks %>% group_by(genre) %>% summarise(daily_sales=mean(daily_sales))
mean.dailySales.genre
## daily_sales
## 1 79.10967
m.dailySales.by.genre <- lm(daily_sales~genre, data=salesofbooks)
anova(m.dailySales.by.genre)
## Analysis of Variance Table
##
## Response: daily_sales
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2 2562528 1281264 2590.5 < 2.2e-16 ***
## Residuals 5997 2966133 495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m.dailySales.by.genre)
##
## Call:
## lm(formula = daily_sales ~ genre, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.396 -13.326 -0.076 13.249 102.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.5773 0.4973 111.76 <2e-16 ***
## genrefiction 50.3087 0.7033 71.53 <2e-16 ***
## genrenon_fiction 20.2886 0.7033 28.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.24 on 5997 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4633
## F-statistic: 2590 on 2 and 5997 DF, p-value: < 2.2e-16
( m.dailySales.by.genre.emm <- emmeans(m.dailySales.by.genre, ~genre) )
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5997 54.6 56.6
## fiction 105.9 0.497 5997 104.9 106.9
## non_fiction 75.9 0.497 5997 74.9 76.8
##
## Confidence level used: 0.95
( m.dailySales.by.genre.contruct <- confint(pairs(m.dailySales.by.genre.emm)) )
## contrast estimate SE df lower.CL upper.CL
## childrens - fiction -50.3 0.703 5997 -52.0 -48.7
## childrens - non_fiction -20.3 0.703 5997 -21.9 -18.6
## fiction - non_fiction 30.0 0.703 5997 28.4 31.7
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
#Plot a CI for a difference
grid.arrange(
ci<-
ggplot(summary(m.dailySales.by.genre.emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL))
+ geom_point()
+ geom_linerange()
+ labs(y="Daily Sales", x="Genre", subtitle="Error bars are 95% CIs", title="Daily Sales")
+ ylim(50,110)
+ coord_flip(),
d.ci<-
ggplot(m.dailySales.by.genre.contruct, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL))
+ geom_point()
+ geom_linerange()
+ labs(y="Difference in Daily Sales", x="Contrast", subtitle="Error bars are 95% CIs", title="Difference in Daily Sales")
+ ylim(-55,35)
+ coord_flip(),
nrow=2
)
ggplot(data = salesofbooks, aes(x = daily_sales, color = genre)) + geom_histogram(binwidth = 1) + xlim(-10,200) + labs(title = "Daily Sales aganst Genre of Books")
##To check what is the daily sales of books against which type of genres and we can see here children genre out numbers fiction and non fiction
ggplot(data = salesofbooks, aes(x = sale_price)) + geom_histogram(binwidth = 0.2) + labs(title = "Distribution of Count against Sale Price")
ggplot(data = salesofbooks, aes(x = avg_review, fill = sold_by)) + geom_histogram(binwidth = 0.1) + labs(title = "Distribution of Avg Review against number of Books of seller categories")
##To check the distrubution of average reviews of different seller categories to see how skewed the distribution is or not
ggplot(data = salesofbooks, aes(x = total_reviews)) + geom_histogram(binwidth = 2)
##To see the distrbution of total reviews, and we can see there are 0 values which are 24 and has been shown in the graph
summarytable <- salesofbooks %>% group_by(genre) %>% summarise_at(vars(daily_sales),
list(mean_sales = mean))
ggplot(summarytable, aes(y = genre, x = mean_sales)) +
geom_col()
##mean of sales against genre to see and assess the relationship
#ggplot(salesofbooks, aes(x= avg_review, y = daily_sales)) + geom_point() + geom_smooth() + labs(title = "Relation of Daily Sales against Average Review")
## Plotting ggplot and finding out that there are 0 Average Review values
##filtering avg reviews values which are equal to zero
p <- salesofbooks %>%
filter(avg_review != 0)
ggplot(p, aes(x= avg_review, y = daily_sales)) + geom_point() + geom_smooth() + labs(title = "Daily Sales against Genre of Books", subtitle = "With 0 avg reviews removed")
##Added a filter and made a ggplot of avg_review against daily_sales with 0 reviews removed
cor(salesofbooks$avg_review,salesofbooks$daily_sales, method = "spearman" )
## [1] 0.004354448
##ggplot(salesofbooks, aes(x= total_reviews, y = daily_sales)) + geom_point() + geom_smooth(method = lm) + labs(title = "Daily Sales against Total Reviews")
ggplot(p, aes(x= total_reviews, y = daily_sales)) + geom_point() + geom_smooth(method = lm) + ylim(0, 270) + labs(title = "Daily Sales against Total Reviews", subtitle = "with 0 reviews removed")
##Added a filter and made a ggplot of total_reviews against daily_sales with 0 reviews removed
cor(salesofbooks$total_reviews,salesofbooks$daily_sales, method = "spearman" )
## [1] 0.678407
ggplot(p, aes(x= avg_review, y = total_reviews)) + geom_point() + geom_smooth(method = lm) + ylim(0, 260) + labs(title = "Total Reviews against Average Review")
rcorr(as.matrix(salesofbooks %>% select(avg_review, daily_sales, total_reviews, sale_price)))
## avg_review daily_sales total_reviews sale_price
## avg_review 1.00 0.00 0.10 -0.02
## daily_sales 0.00 1.00 0.66 -0.28
## total_reviews 0.10 0.66 1.00 -0.26
## sale_price -0.02 -0.28 -0.26 1.00
##
## n= 6000
##
##
## P
## avg_review daily_sales total_reviews sale_price
## avg_review 0.7474 0.0000 0.2347
## daily_sales 0.7474 0.0000 0.0000
## total_reviews 0.0000 0.0000 0.0000
## sale_price 0.2347 0.0000 0.0000
ggplot(salesofbooks, aes(x = daily_sales)) + geom_boxplot() + labs(x="daily sales", y="frequency") + facet_grid(facets = "genre", col = T)
m.daily.sales.by.genre <- lm(daily_sales ~ genre, data = salesofbooks)
(m.daily.sales.by.genre.emm <- emmeans(m.daily.sales.by.genre, ~genre))
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5997 54.6 56.6
## fiction 105.9 0.497 5997 104.9 106.9
## non_fiction 75.9 0.497 5997 74.9 76.8
##
## Confidence level used: 0.95
summary(m.daily.sales.by.genre)
##
## Call:
## lm(formula = daily_sales ~ genre, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.396 -13.326 -0.076 13.249 102.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.5773 0.4973 111.76 <2e-16 ***
## genrefiction 50.3087 0.7033 71.53 <2e-16 ***
## genrenon_fiction 20.2886 0.7033 28.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.24 on 5997 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4633
## F-statistic: 2590 on 2 and 5997 DF, p-value: < 2.2e-16
m.sales_review_A <- lm(daily_sales ~ avg_review + total_reviews, data = salesofbooks)
summary(m.sales_review_A)
##
## Call:
## lm(formula = daily_sales ~ avg_review + total_reviews, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.396 -14.645 -1.059 13.690 122.429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.870506 2.341271 10.196 < 2e-16 ***
## avg_review -3.943548 0.513120 -7.685 1.77e-14 ***
## total_reviews 0.543329 0.007823 69.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.6 on 5997 degrees of freedom
## Multiple R-squared: 0.4458, Adjusted R-squared: 0.4456
## F-statistic: 2412 on 2 and 5997 DF, p-value: < 2.2e-16
cbind(coef(m.sales_review_A), confint(m.sales_review_A))
## 2.5 % 97.5 %
## (Intercept) 23.870506 19.2807719 28.4602399
## avg_review -3.943548 -4.9494480 -2.9376473
## total_reviews 0.543329 0.5279926 0.5586653
m.sales_review_B <- lm(daily_sales ~ avg_review * total_reviews, data = salesofbooks)
summary(m.sales_review_B)
##
## Call:
## lm(formula = daily_sales ~ avg_review * total_reviews, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.08 -14.63 -0.92 13.82 92.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.546900 4.178047 15.210 < 2e-16 ***
## avg_review -13.683765 0.993159 -13.778 < 2e-16 ***
## total_reviews 0.164754 0.034068 4.836 1.36e-06 ***
## avg_review:total_reviews 0.091688 0.008035 11.411 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.36 on 5996 degrees of freedom
## Multiple R-squared: 0.4576, Adjusted R-squared: 0.4573
## F-statistic: 1686 on 3 and 5996 DF, p-value: < 2.2e-16
cbind(coef(m.sales_review_B), confint(m.sales_review_B))
## 2.5 % 97.5 %
## (Intercept) 63.54690004 55.35642562 71.7373745
## avg_review -13.68376484 -15.63071313 -11.7368165
## total_reviews 0.16475390 0.09796872 0.2315391
## avg_review:total_reviews 0.09168842 0.07593650 0.1074403
anova(m.sales_review_A, m.sales_review_B)
## Analysis of Variance Table
##
## Model 1: daily_sales ~ avg_review + total_reviews
## Model 2: daily_sales ~ avg_review * total_reviews
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5997 3064100
## 2 5996 2998976 1 65125 130.21 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot( data = salesofbooks, aes(x= daily_sales, y= avg_review)) + geom_point() +geom_smooth(method = "lm") #Plotting a scatter plot to know the relationship between sales and average review
ggplot( data = salesofbooks, aes(x= daily_sales, y= total_reviews)) + geom_point() +geom_smooth(method = "lm") #Plotting a scatter plot to know the relationship between sales and total review
ggplot(salesofbooks, aes(x= avg_review, y = total_reviews)) + geom_point() + geom_smooth(method = lm)
m.sales_price_for_genre_A <- lm(daily_sales ~ sale_price + genre, data = salesofbooks)
summary(m.sales_price_for_genre_A)
##
## Call:
## lm(formula = daily_sales ~ sale_price + genre, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.357 -13.311 0.031 13.097 102.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.8931 1.5195 42.05 < 2e-16 ***
## sale_price -0.8324 0.1438 -5.79 7.4e-09 ***
## genrefiction 48.6713 0.7562 64.36 < 2e-16 ***
## genrenon_fiction 18.5587 0.7624 24.34 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.18 on 5996 degrees of freedom
## Multiple R-squared: 0.4665, Adjusted R-squared: 0.4662
## F-statistic: 1748 on 3 and 5996 DF, p-value: < 2.2e-16
cbind(coef(m.sales_price_for_genre_A), confint(m.sales_price_for_genre_A))
## 2.5 % 97.5 %
## (Intercept) 63.8930553 60.914300 66.8718111
## sale_price -0.8324344 -1.114286 -0.5505827
## genrefiction 48.6713347 47.188823 50.1538467
## genrenon_fiction 18.5587230 17.064212 20.0532344
m.sales_price_for_genre_B <- lm(daily_sales ~ sale_price * genre, data = salesofbooks)
summary(m.sales_price_for_genre_B)
##
## Call:
## lm(formula = daily_sales ~ sale_price * genre, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.38 -13.37 0.03 13.08 102.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.8781 2.5025 29.122 < 2e-16 ***
## sale_price -1.7319 0.2456 -7.053 1.95e-12 ***
## genrefiction 35.1993 3.2740 10.751 < 2e-16 ***
## genrenon_fiction 6.5492 3.2040 2.044 0.040989 *
## sale_price:genrefiction 1.4587 0.3546 4.114 3.94e-05 ***
## sale_price:genrenon_fiction 1.2817 0.3469 3.695 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.15 on 5994 degrees of freedom
## Multiple R-squared: 0.4683, Adjusted R-squared: 0.4679
## F-statistic: 1056 on 5 and 5994 DF, p-value: < 2.2e-16
cbind(coef(m.sales_price_for_genre_B), confint(m.sales_price_for_genre_B))
## 2.5 % 97.5 %
## (Intercept) 72.878117 67.9722348 77.784000
## sale_price -1.731864 -2.2132471 -1.250482
## genrefiction 35.199273 28.7810791 41.617467
## genrenon_fiction 6.549246 0.2682356 12.830257
## sale_price:genrefiction 1.458709 0.7636155 2.153802
## sale_price:genrenon_fiction 1.281703 0.6016749 1.961731
anova(m.sales_price_for_genre_A, m.sales_price_for_genre_B)
## Analysis of Variance Table
##
## Model 1: daily_sales ~ sale_price + genre
## Model 2: daily_sales ~ sale_price * genre
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5996 2949642
## 2 5994 2939524 2 10118 10.316 3.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(m.sales_price_for_genre_A)
## GVIF Df GVIF^(1/(2*Df))
## sale_price 1.229697 1 1.108917
## genre 1.229697 2 1.053051
vif(m.sales_price_for_genre_B, type = 'predictor')
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
## sale_price 1 5 1 genre --
## genre 1 5 1 sale_price --
salesbytotalreview <- lm(daily_sales ~ total_reviews, data = salesofbooks)
summary(salesbytotalreview)
##
## Call:
## lm(formula = daily_sales ~ total_reviews, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.202 -14.824 -1.026 13.620 138.424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.875622 1.077639 7.308 3.06e-13 ***
## total_reviews 0.537048 0.007818 68.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.71 on 5998 degrees of freedom
## Multiple R-squared: 0.4403, Adjusted R-squared: 0.4402
## F-statistic: 4719 on 1 and 5998 DF, p-value: < 2.2e-16
(salesofbooks <- salesofbooks %>% mutate(sales.hat=predict(salesbytotalreview)))
## # A tibble: 6,000 × 8
## sold_by publi…¹ genre avg_r…² daily…³ total…⁴ sale_…⁵ sales…⁶
## <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Random House LLC big fi… chil… 4.44 61.6 92 8.03 57.3
## 2 Amazon Digital Service… indie non_… 4.19 74.9 130 9.08 77.7
## 3 Amazon Digital Service… small/… non_… 3.71 66.0 118 9.48 71.2
## 4 Amazon Digital Service… small/… fict… 4.72 85.2 179 12.3 104.
## 5 Simon and Schuster Dig… big fi… chil… 4.65 37.7 111 5.78 67.5
## 6 Simon and Schuster Dig… big fi… chil… 4.81 70.6 106 11.7 64.8
## 7 Amazon Digital Service… small/… fict… 4.33 172. 205 10.3 118.
## 8 HarperCollins Publishe… big fi… chil… 4.21 59.4 86 11.4 54.1
## 9 Amazon Digital Service… small/… fict… 3.95 134. 161 7.08 94.3
## 10 Amazon Digital Service… small/… chil… 4.66 62.2 81 10.8 51.4
## # … with 5,990 more rows, and abbreviated variable names ¹publisher_type,
## # ²avg_review, ³daily_sales, ⁴total_reviews, ⁵sale_price, ⁶sales.hat
ggplot(salesofbooks, mapping = aes(y=daily_sales, x=total_reviews, ymin=daily_sales, ymax=sales.hat)) +
geom_point() +
labs(x="total reviews", y="daily sales", title = "Total Reviews against Daily Sales", subtitle="Vertical lines show the residuals") +
geom_smooth(method=lm)
##Plotted for understanding as to what is the relationship between total reviews and daily sales, and we can conclude that there is a positive relationship, as total reviews increase, daily sales will increase as well.
salesbyavgreview <- lm(daily_sales ~ avg_review, data = salesofbooks)
summary(salesbyavgreview)
##
## Call:
## lm(formula = daily_sales ~ avg_review, data = salesofbooks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.944 -22.299 -4.837 18.943 128.948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.0517 2.9510 27.127 <2e-16 ***
## avg_review -0.2208 0.6854 -0.322 0.747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.36 on 5998 degrees of freedom
## Multiple R-squared: 1.729e-05, Adjusted R-squared: -0.0001494
## F-statistic: 0.1037 on 1 and 5998 DF, p-value: 0.7474
salesbyavgreviewtotalreviews <- lm(daily_sales ~ avg_review + total_reviews, data= salesofbooks)
anova(salesbyavgreviewtotalreviews)
## Analysis of Variance Table
##
## Response: daily_sales
## Df Sum Sq Mean Sq F value Pr(>F)
## avg_review 1 96 96 0.1871 0.6653
## total_reviews 1 2464464 2464464 4823.4038 <2e-16 ***
## Residuals 5997 3064100 511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
children <- salesofbooks %>% filter(genre == 'childrens')
fiction <- salesofbooks %>% filter(genre == 'fiction')
nonfiction <- salesofbooks %>% filter(genre == 'non_fiction')
ggplot(data = nonfiction, aes(x = sale_price, y = daily_sales)) + geom_point() + geom_smooth(method = lm) + labs(title = "Daily Sales Against Sale Price")
salesbysalepricegenre <- lm(daily_sales ~ sale_price * genre, data=salesofbooks)
anova(salesbysalepricegenre)
## Analysis of Variance Table
##
## Response: daily_sales
## Df Sum Sq Mean Sq F value Pr(>F)
## sale_price 1 426054 426054 868.770 < 2.2e-16 ***
## genre 2 2152964 1076482 2195.060 < 2.2e-16 ***
## sale_price:genre 2 10118 5059 10.316 3.37e-05 ***
## Residuals 5994 2939524 490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above figure we can clearly state that the genre that is “fiction” it has almost as double sales on average when compares to the genre “children” and we can conclude that different genres have different sales on average.
The relationship of daily sales and average review can not be properly established here and we can not conclude much so we need to consider the relationship of total reviews and daily sales, and see where we reach and what conclusion is where we get at, and we also see relationship of average reviews and total reviews.
Now looking at these three scatter plots above and after looking at the relationship of daily sales against total reviews and daily sales against average reviews, and it is hard to come to a conclusion so we have plotted average reviews against total reviews as well, and we can see that as average number of reviews and total reviews are increasing, along with it daily sales are also increasing. So we can say that as total number of reviews are increasing, daily sales will increase as well. And we have further proved it with statsitical measure which is correlation, and the value of corelation here is 0.68. As per our model, we can establish that for every increasae of 1 in the total review score, the daily sales will increase bt around 0.63£ and for every increase in the average review about 1, the daily sales will decrease around £0.3.
```
We can say here that as the sale price will increase the daily sales will decrease slightly as well, and we can conclude that there is a negative relationship between sale price and daily sales.